On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers
نویسندگان
چکیده
Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. In this paper we introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented, and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by messages, on the application performance.
منابع مشابه
An approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملAn Event-driven Approach to Multiprocessor Diagnosis
For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a scalable fault diagnosis is necessary, which works efficiently even if there are several thousand processors in the system. In this paper we present an event-driven, distributed system-level diagnosis algorithm, based on a general diagnosis model which does not limit the number of simultaneously ex...
متن کاملA Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems
A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...
متن کاملAn Approach for Hierarchical System Level Diagnosis of Massively Parallel Computers Combined with a Simulation-Based Method for Dependability Analysis
The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Theref...
متن کاملReversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs
Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...
متن کامل